AITopics | completion rate

Collaborating Authors

completion rate

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

AdaSociety: An Adaptive Environment with Social Structures for Multi-Agent Decision-Making Yizhe Huang 2,1 Xingbo Wang 2 Hao Liu 3 Fanqi Kong 2,1

Neural Information Processing SystemsFeb-11-2026, 16:21:23 GMT

Traditional interactive environments limit agents' intelligence growth with fixed

artificial intelligence, deep learning, machine learning, (16 more...)

Neural Information Processing Systems

Country:

Europe > Sweden > Skåne County > Malmö (0.04)
North America > United States > Montana (0.04)
Asia > China > Hubei Province > Wuhan (0.04)

Genre: Research Report (0.46)

Industry:

Education (0.67)
Leisure & Entertainment > Games > Computer Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

E-MAPP: EfficientMulti-AgentReinforcement LearningwithParallelProgramGuidance

Neural Information Processing SystemsFeb-8-2026, 21:29:29 GMT

The agents often have difficulties in cooperating on common goals, dividing complex tasks, and planning through several stages to make progress.

agent, artificial intelligence, machine learning, (18 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents [Technical Report]

Peeters, Ralph, Steiner, Aaron, Schwarz, Luca, Caspary, Julian Yuya, Bizer, Christian

arXiv.org Artificial IntelligenceDec-3-2025

LLM-based web agents have the potential to automate long-running web tasks, such as searching for products in multiple e-shops and subsequently ordering the cheapest products that meet the users needs. Benchmarks for evaluating web agents either require agents to perform tasks online using the live Web or offline using simulated environments, which allow for the exact reproduction of the experimental setup. While DeepShop provides an online benchmark that requires agents to perform challenging shopping tasks, existing offline benchmarks such as WebShop, WebArena, or Mind2Web cover only comparatively simple e-commerce tasks that need to be performed against a single shop containing product data from a single source. What is missing is an e-commerce benchmark that simulates multiple shops containing heterogeneous product data and requires agents to perform complex tasks. We fill this gap by introducing WebMall, the first offline multi-shop benchmark for evaluating web agents on challenging comparison shopping tasks. WebMall consists of four simulated shops populated with product data extracted from the Common Crawl. The WebMall tasks range from specific product searches and price comparisons to advanced queries for complementary or substitute products, as well as checkout processes. We validate WebMall using eight agents that differ in observation space, availability of short-term memory, and the employed LLM. The validation highlights the difficulty of the benchmark, with even the best-performing agents achieving task completion rates below 55% in the task categories cheapest product search and vague product search.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2508.13024

Genre:

Workflow (0.68)
Overview (0.68)
Research Report (0.50)

Industry: Information Technology > Services > e-Commerce Services (0.55)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Communications > Web (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
(2 more...)

Add feedback

3e4d8407cb468850f2f8f4a949e64bf0-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsOct-10-2025, 00:05:53 GMT

adasociety, agent, social structure, (13 more...)

Neural Information Processing Systems

Country:

Europe > Sweden > Skåne County > Malmö (0.04)
North America > United States > New York (0.04)
North America > United States > Montana (0.04)
Asia > China > Hubei Province > Wuhan (0.04)

Genre: Research Report (0.46)

Industry:

Education (0.67)
Leisure & Entertainment > Games > Computer Games (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (0.93)
(3 more...)

Add feedback

The Valley of Code Reasoning: Scaling Knowledge Distillation of Large Language Models

He, Muyu, Shafique, Muhammad Ali, Kumar, Anand, Mackey, Tsach, Rajani, Nazneen

arXiv.org Artificial IntelligenceOct-8-2025

Distilling the thinking traces of a Large Language Model (LLM) with reasoning capabilities into a smaller model has been proven effective. Yet, there is a scarcity of work done on how model performances scale with the quantity of distillation data. In this work, we study the scaling trend of distilling competitive coding skills on two small non-reasoning LLMs. We validate the hypothesis that there is a $\textit{valley of code reasoning}$: downstream performance on competitive coding first drops as data quantity increases, then it steadily increases in a sharper-than-log-linear fashion. Having identified the trend, we further fine-tune the models at two different distillation stages on the same data to ground conclusions on their respective learning phases. We learn that across stages in the low and medium-low data regimes, small models benefit significantly from easier coding questions than from harder ones. We also find that, surprisingly, the correctness of outputs in training data makes no difference to distillation outcomes. Our work represents a step forward in understanding the training dynamics of code reasoning distillation outside intuition

large language model, machine learning, qwen2, (16 more...)

arXiv.org Artificial Intelligence

2510.06101

Genre: Research Report (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

xOffense: An AI-driven autonomous penetration testing framework with offensive knowledge-enhanced LLMs and multi agent systems

Luong, Phung Duc, Bao, Le Tran Gia, Tam, Nguyen Vu Khai, Khoa, Dong Huu Nguyen, Quyen, Nguyen Huu, Pham, Van-Hau, Duy, Phan The

arXiv.org Artificial IntelligenceSep-17-2025

This work introduces xOffense, an AI-driven, multi-agent penetration testing framework that shifts the process from labor-intensive, expert-driven manual efforts to fully automated, machine-executable workflows capable of scaling seamlessly with computational infrastructure. At its core, xOffense leverages a fine-tuned, mid-scale open-source LLM (Qwen3-32B) to drive reasoning and decision-making in penetration testing. The framework assigns specialized agents to reconnaissance, vulnerability scanning, and exploitation, with an orchestration layer ensuring seamless coordination across phases. Fine-tuning on Chain-of-Thought penetration testing data further enables the model to generate precise tool commands and perform consistent multi-step reasoning. We evaluate xOffense on two rigorous benchmarks: AutoPenBench and AI-Pentest-Benchmark. The results demonstrate that xOffense consistently outperforms contemporary methods, achieving a sub-task completion rate of 79.17%, decisively surpassing leading systems such as VulnBot and PentestGPT. These findings highlight the potential of domain-adapted mid-scale LLMs, when embedded within structured multi-agent orchestration, to deliver superior, cost-efficient, and reproducible solutions for autonomous penetration testing.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2509.13021

Country: Asia > Vietnam (0.14)

Genre: Research Report > New Finding (0.48)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (0.68)
Energy (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Empowering Clinical Trial Design through AI: A Randomized Evaluation of PowerGPT

Lu, Yiwen, Li, Lu, Zhang, Dazheng, Jian, Xinyao, Wang, Tingyin, Chen, Siqi, Lei, Yuqing, Tong, Jiayi, Xi, Zhaohan, Chu, Haitao, Luo, Chongliang, Ogdie, Alexis, Athey, Brian, Turan, Alparslan, Abramoff, Michael, Cappelleri, Joseph C, Xu, Hua, Lu, Yun, Berlin, Jesse, Sessler, Daniel I., Asch, David A., Jiang, Xiaoqian, Chen, Yong

arXiv.org Artificial IntelligenceSep-17-2025

Sample size calculations for power analysis are critical for clinical research and trial design, yet their complexity and reliance on statistical expertise create barriers for many researchers. We introduce PowerGPT, an AI-powered system integrating large language models (LLMs) with statistical engines to automate test selection and sample size estimation in trial design. In a randomized trial to evaluate its effectiveness, PowerGPT significantly improved task completion rates (99.3% vs. 88.9% for test selection, 99.3% vs. 77.8% for sample size calculation) and accuracy (94.1% vs. 55.4% in sample size estimation, p < 0.001), while reducing average completion time (4.0 vs. 9.3 minutes, p < 0.001). These gains were consistent across various statistical tests and benefited both statisticians and non-statisticians as well as bridging expertise gaps. Already under deployment across multiple institutions, PowerGPT represents a scalable AI-driven approach that enhances accessibility, efficiency, and accuracy in statistical power analysis for clinical research.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2509.12471

Country:

North America > United States > Pennsylvania (0.30)
North America > United States > Texas (0.28)
North America > United States > Iowa (0.28)
(3 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Government > Regional Government > North America Government > United States Government (0.94)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

CausalMACE: Causality Empowered Multi-Agents in Minecraft Cooperative Tasks

Chai, Qi, Zheng, Zhang, Ren, Junlong, Ye, Deheng, Lin, Zichuan, Wang, Hao

arXiv.org Artificial IntelligenceAug-27-2025

Minecraft, as an open-world virtual interactive environment, has become a prominent platform for research on agent decision-making and execution. Existing works primarily adopt a single Large Language Model (LLM) agent to complete various in-game tasks. However, for complex tasks requiring lengthy sequences of actions, single-agent approaches often face challenges related to inefficiency and limited fault tolerance. Despite these issues, research on multi-agent collaboration remains scarce. In this paper, we propose CausalMACE, a holistic causality planning framework designed to enhance multi-agent systems, in which we incorporate causality to manage dependencies among subtasks. Technically, our proposed framework introduces two modules: an overarching task graph for global task planning and a causality-based module for dependency management, where inherent rules are adopted to perform causal intervention. Experimental results demonstrate our approach achieves state-of-the-art performance in multi-agent cooperative tasks of Minecraft.

agent, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2508.18797

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.66)

Industry:

Leisure & Entertainment > Games > Computer Games (0.93)
Materials > Metals & Mining (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Real-World Receptivity to Adaptive Mental Health Interventions: Findings from an In-the-Wild Study

Sahu, Nilesh Kumar, Sneh, Aditya, Gupta, Snehil, Lone, Haroon R

arXiv.org Artificial IntelligenceAug-6-2025

The rise of mobile health (mHealth) technologies has enabled real-time monitoring and intervention for mental health conditions using passively sensed smartphone data. Building on these capabilities, Just-in-Time Adaptive Interventions (JITAIs) seek to deliver personalized support at opportune moments, adapting to users' evolving contexts and needs. Although prior research has examined how context affects user responses to generic notifications and general mHealth messages, relatively little work has explored its influence on engagement with actual mental health interventions. Furthermore, while much of the existing research has focused on detecting when users might benefit from an intervention, less attention has been paid to understanding receptivity, i.e., users' willingness and ability to engage with and act upon the intervention. In this study, we investigate user receptivity through two components: acceptance(acknowledging or engaging with a prompt) and feasibility (ability to act given situational constraints). We conducted a two-week in-the-wild study with 70 students using a custom Android app, LogMe, which collected passive sensor data and active context reports to prompt mental health interventions. The adaptive intervention module was built using Thompson Sampling, a reinforcement learning algorithm. We address four research questions relating smartphone features and self-reported contexts to acceptance and feasibility, and examine whether an adaptive reinforcement learning approach can optimize intervention delivery by maximizing a combined receptivity reward. Our results show that several types of passively sensed data significantly influenced user receptivity to interventions. Our findings contribute insights into the design of context-aware, adaptive interventions that are not only timely but also actionable in real-world settings.

intervention, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2508.02817

Country:

North America > United States (0.28)
Asia > India > Madhya Pradesh (0.15)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study > Negative Result (0.88)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)

Technology:

Information Technology > Communications > Mobile (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Filters

Collaborating Authors

completion rate

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

4f2accafe6fa355624f3ee42207cc7b8-Paper-Conference.pdf

AdaSociety: An Adaptive Environment with Social Structures for Multi-Agent Decision-Making Yizhe Huang 2,1 Xingbo Wang 2 Hao Liu 3 Fanqi Kong 2,1

E-MAPP: EfficientMulti-AgentReinforcement LearningwithParallelProgramGuidance

WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents [Technical Report]

3e4d8407cb468850f2f8f4a949e64bf0-Paper-Datasets_and_Benchmarks_Track.pdf

The Valley of Code Reasoning: Scaling Knowledge Distillation of Large Language Models

xOffense: An AI-driven autonomous penetration testing framework with offensive knowledge-enhanced LLMs and multi agent systems

Empowering Clinical Trial Design through AI: A Randomized Evaluation of PowerGPT

CausalMACE: Causality Empowered Multi-Agents in Minecraft Cooperative Tasks

Real-World Receptivity to Adaptive Mental Health Interventions: Findings from an In-the-Wild Study